fix(datasets): Add shuffle option to IidPartitioner#7385
fix(datasets): Add shuffle option to IidPartitioner#7385WilliamLindskog wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR enhances flwr_datasets by (1) adding optional shuffling to IidPartitioner while preserving the historical contiguous-shard default behavior, and (2) introducing public partition skew distance metrics (Hellinger and Jensen–Shannon) to quantify how partition label/target distributions differ from the full dataset.
Changes:
- Extend
IidPartitionerwithshuffle/seedand cache the shuffled dataset per instance for stable repeated loads. - Add
compute_hellinger_distancesandcompute_jensen_shannon_distances(with optional binning for continuous targets) plus test coverage. - Update README and Sphinx docs to reflect the new
IidPartitionersignature and new skew metrics.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
datasets/README.md |
Documents IidPartitioner(shuffle, seed) usage and introduces partition skew metrics in the library overview/quickstart. |
datasets/flwr_datasets/partitioner/iid_partitioner.py |
Adds shuffle/seed parameters and per-instance caching of the shuffled dataset used for sharding. |
datasets/flwr_datasets/partitioner/iid_partitioner_test.py |
Adds regression and determinism tests for default contiguous behavior and shuffled sharding. |
datasets/flwr_datasets/metrics/utils.py |
Implements Hellinger and Jensen–Shannon distance utilities (including optional binning) and related helpers. |
datasets/flwr_datasets/metrics/utils_test.py |
Adds unit tests validating distance values, binning behavior, and input validation. |
datasets/flwr_datasets/metrics/__init__.py |
Exposes the new metric functions as part of the public flwr_datasets.metrics API. |
datasets/docs/source/index.rst |
Updates feature list and IidPartitioner signature in docs landing page. |
datasets/docs/source/how-to-use-with-local-data.rst |
Adds guidance for shuffling sorted local datasets via IidPartitioner(shuffle=True, seed=...). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Update: I split this into the lower-friction review path.
Focused validation passed for the touched IID partitioner files: |
| if not self._shuffle: | ||
| return self.dataset | ||
| if self._shuffled_dataset is None: | ||
| self._shuffled_dataset = self.dataset.shuffle(seed=self._seed) |
There was a problem hiding this comment.
does this mean we have two copies of the dataset?
There was a problem hiding this comment.
Good question. Dataset.shuffle(...) returns another Hugging Face Dataset object with shuffled indices/cache metadata rather than eagerly duplicating all row data. So this keeps a second dataset object around, but it should not be a full in-memory copy of the underlying dataset. The cache here is intentional so repeated load_partition calls use the same shuffled order, especially when seed=None.
What changed
shuffleandseedparameters toIidPartitionershuffle=False)seed=NoneIssue/PR mapping
Validation
pytest,ruff,mypy, andblack --checkon the touched IID partitioner filesgit diff --check